First calculated in the mid-90’s, graduation rate keeps record of full-time, first-time students who start college and graduate 4 years later. If graduation rates at an institution are low, it can be inferred that students do not get the academic support they need to succeed, they are disappointed by the faculty or staff, or that they find life at the school unaffordable. Due to the significant financial impact that secondary education has on the economy the government created a College Scorecard to provide reliable and unbiased information about college performance’ (1). It is pertinent for institutions to measure and improve their graduation rate as the number impacts the amount of performance based funding awarded to them by the government. In this analysis, we evaluate and assess the factors that most significantly influence an institution’s graduation rate for the years 2009- 2016 (2).
First step was to read the data into R by using sapply(), and passing in the files and the read_csv function.
# files <- dir(path = "~/Documents/Data_munging_project/CollegeScorecard_Raw_Data")
# data <- sapply(files, read_csv, simplify=FALSE)
We then added a year column to each file.
# data$MERGED1996_97_PP.csv$YEAR <- 1996
# data$MERGED1997_98_PP.csv$YEAR <- 1997
# data$MERGED1998_99_PP.csv$YEAR <- 1998
# data$MERGED1999_00_PP.csv$YEAR <- 1999
# data$MERGED2000_01_PP.csv$YEAR <- 2000
# data$MERGED2001_02_PP.csv$YEAR <- 2001
# data$MERGED2002_03_PP.csv$YEAR <- 2002
# data$MERGED2003_04_PP.csv$YEAR <- 2003
# data$MERGED2004_05_PP.csv$YEAR <- 2004
# data$MERGED2005_06_PP.csv$YEAR <- 2005
# data$MERGED2006_07_PP.csv$YEAR <- 2006
# data$MERGED2007_08_PP.csv$YEAR <- 2007
# data$MERGED2008_09_PP.csv$YEAR <- 2008
# data$MERGED2009_10_PP.csv$YEAR <- 2009
# data$MERGED2010_11_PP.csv$YEAR <- 2010
# data$MERGED2011_12_PP.csv$YEAR <- 2011
# data$MERGED2012_13_PP.csv$YEAR <- 2012
# data$MERGED2013_14_PP.csv$YEAR <- 2013
# data$MERGED2014_15_PP.csv$YEAR <- 2014
# data$MERGED2015_16_PP.csv$YEAR <- 2015
# data$MERGED2016_17_PP.csv$YEAR <- 2016
# data$MERGED2017_18_PP.csv$YEAR <- 2017
Then we wrote each csv with the updated year column to files, where they were then merged in the command line.
# write_csv(data$MERGED1996_97_PP.csv, "college_scorecard1996.csv")
# write_csv(data$MERGED1997_98_PP.csv, "college_scorecard1997.csv")
# write_csv(data$MERGED1998_99_PP.csv, "college_scorecard1998.csv")
# write_csv(data$MERGED1999_00_PP.csv, "college_scorecard1999.csv")
# write_csv(data$MERGED2000_01_PP.csv, "college_scorecard2000.csv")
# write_csv(data$MERGED2001_02_PP.csv, "college_scorecard2001.csv")
# write_csv(data$MERGED2002_03_PP.csv, "college_scorecard2002.csv")
# write_csv(data$MERGED2003_04_PP.csv, "college_scorecard2003.csv")
# write_csv(data$MERGED2004_05_PP.csv, "college_scorecard2004.csv")
# write_csv(data$MERGED2005_06_PP.csv, "college_scorecard2005.csv")
# write_csv(data$MERGED2006_07_PP.csv, "college_scorecard2006.csv")
# write_csv(data$MERGED2007_08_PP.csv, "college_scorecard2007.csv")
# write_csv(data$MERGED2008_09_PP.csv, "college_scorecard2008.csv")
# write_csv(data$MERGED2009_10_PP.csv, "college_scorecard2009.csv")
# write_csv(data$MERGED2010_11_PP.csv, "college_scorecard2010.csv")
# write_csv(data$MERGED2011_12_PP.csv, "college_scorecard2011.csv")
# write_csv(data$MERGED2012_13_PP.csv, "college_scorecard2012.csv")
# write_csv(data$MERGED2013_14_PP.csv, "college_scorecard2013.csv")
# write_csv(data$MERGED2014_15_PP.csv, "college_scorecard2014.csv")
# write_csv(data$MERGED2015_16_PP.csv, "college_scorecard2015.csv")
# write_csv(data$MERGED2016_17_PP.csv, "college_scorecard2016.csv")
# write_csv(data$MERGED2017_18_PP.csv, "college_scorecard2017.csv")
# Command used to merge data:
#awk '(NR == 1) || (FNR > 1)' *.csv > CollegeData.csv
The file was read into R, and then used select() to subset only the columns of interest. We then wrote that to “merged.csv” so we don’t have to load the entire dataset into R every time.
#collegeDataSubset <- read_csv("CollegeData.csv")
#collegeDataSubset <- select(CollegeData.csv, YEAR, C150_4, AVGFACSAL, COSTT4_A, COSTT4_P, TUITIONFEE_IN, TUITIONFEE_OUT, TUITIONFEE_PROG, TUITFTE, INEXPFTE, PCTPELL, PCTFLOAN, UG25ABV, UNITID, INSTNM, CITY, STABBR, ZIP, NUMBRANCH, HIGHDEG, CONTROL, LOCALE2, LATITUDE, LONGITUDE, ADM_RATE, ADM_RATE_ALL, SAT_AVG, ACTCMMID, UGDS, UG, UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN, NPT4_PUB, NPT4_PRIV, PAR_ED_PCT_1STGEN, PAR_ED_PCT_MS, PAR_ED_PCT_HS, PAR_ED_PCT_PS, DEP_INC_AVG, DEBT_MDN, GRAD_DEBT_MDN, WDRAW_DEBT_MDN, LO_INC_DEBT_MDN, MD_INC_DEBT_MDN, HI_INC_DEBT_MDN, FEMALE, FAMINC, MD_FAMINC, MN_EARN_WNE_P10, MD_EARN_WNE_P10)
#write_csv(collegeDataSubset, "merged.csv")
Read in new data as college_scorecard and set the values “NULL” and “Privacy Suppressed” to NA. Also set all numeric variables to type “numeric”.
# read in new data
college_scorecard <- read_csv("merged.csv")
#str(college_scorecard)
# set all NULL, PrivacySuppressed to NA
college_scorecard[college_scorecard == "NULL"] <- NA
college_scorecard[college_scorecard == "PrivacySuppressed"] <- NA
# change data types:
college_scorecard$YEAR <- as.numeric(college_scorecard$YEAR)
college_scorecard$C150_4 <- as.numeric(college_scorecard$C150_4)
college_scorecard$AVGFACSAL <- as.numeric(college_scorecard$AVGFACSAL)
college_scorecard$COSTT4_A <- as.numeric(college_scorecard$COSTT4_A)
college_scorecard$COSTT4_P <- as.numeric(college_scorecard$COSTT4_P)
college_scorecard$TUITIONFEE_IN <- as.numeric(college_scorecard$TUITIONFEE_IN)
college_scorecard$TUITIONFEE_OUT <- as.numeric(college_scorecard$TUITIONFEE_OUT)
college_scorecard$TUITIONFEE_PROG <- as.numeric(college_scorecard$TUITIONFEE_PROG)
college_scorecard$TUITFTE <- as.numeric(college_scorecard$TUITFTE)
college_scorecard$INEXPFTE <- as.numeric(college_scorecard$INEXPFTE)
college_scorecard$PCTPELL <- as.numeric(college_scorecard$PCTPELL)
college_scorecard$PCTFLOAN <- as.numeric(college_scorecard$PCTFLOAN)
college_scorecard$UG25ABV <- as.numeric(college_scorecard$UG25ABV)
college_scorecard$LATITUDE <- as.numeric(college_scorecard$LATITUDE)
college_scorecard$LONGITUDE <- as.numeric(college_scorecard$LONGITUDE)
college_scorecard$ADM_RATE <- as.numeric(college_scorecard$ADM_RATE)
college_scorecard$ADM_RATE_ALL <- as.numeric(college_scorecard$ADM_RATE_ALL)
college_scorecard$SAT_AVG <- as.numeric(college_scorecard$SAT_AVG)
college_scorecard$ACTCMMID <- as.numeric(college_scorecard$ACTCMMID)
college_scorecard$UG <- as.numeric(college_scorecard$UG)
college_scorecard$UGDS_WHITE <- as.numeric(college_scorecard$UGDS_WHITE)
college_scorecard$UGDS_BLACK <- as.numeric(college_scorecard$UGDS_BLACK)
college_scorecard$UGDS_HISP <- as.numeric(college_scorecard$UGDS_HISP)
college_scorecard$UGDS_ASIAN <- as.numeric(college_scorecard$UGDS_ASIAN)
college_scorecard$NPT4_PUB <- as.numeric(college_scorecard$NPT4_PUB)
college_scorecard$NPT4_PRIV <- as.numeric(college_scorecard$NPT4_PRIV)
college_scorecard$PAR_ED_PCT_1STGEN <- as.numeric(college_scorecard$PAR_ED_PCT_1STGEN)
college_scorecard$PAR_ED_PCT_MS <- as.numeric(college_scorecard$PAR_ED_PCT_MS)
college_scorecard$PAR_ED_PCT_HS <- as.numeric(college_scorecard$PAR_ED_PCT_HS)
college_scorecard$PAR_ED_PCT_PS <- as.numeric(college_scorecard$PAR_ED_PCT_PS)
college_scorecard$DEP_INC_AVG <- as.numeric(college_scorecard$DEP_INC_AVG)
college_scorecard$DEBT_MDN <- as.numeric(college_scorecard$DEBT_MDN)
college_scorecard$GRAD_DEBT_MDN <- as.numeric(college_scorecard$GRAD_DEBT_MDN)
college_scorecard$WDRAW_DEBT_MDN <- as.numeric(college_scorecard$WDRAW_DEBT_MDN)
college_scorecard$LO_INC_DEBT_MDN <- as.numeric(college_scorecard$LO_INC_DEBT_MDN)
college_scorecard$MD_INC_DEBT_MDN <- as.numeric(college_scorecard$MD_INC_DEBT_MDN)
college_scorecard$HI_INC_DEBT_MDN <- as.numeric(college_scorecard$HI_INC_DEBT_MDN)
college_scorecard$FEMALE <- as.numeric(college_scorecard$FEMALE)
college_scorecard$FAMINC <- as.numeric(college_scorecard$FAMINC)
college_scorecard$MD_FAMINC <- as.numeric(college_scorecard$MD_FAMINC)
college_scorecard$MN_EARN_WNE_P10 <- as.numeric(college_scorecard$MN_EARN_WNE_P10)
college_scorecard$MD_EARN_WNE_P10 <- as.numeric(college_scorecard$MD_EARN_WNE_P10)
There is definitely significant missing data, with roughly 71% of graduation rate missing of roughly 154,000 rows.
Here is total missing data for variables with above average missing data.
sum.NA.total <- college_scorecard %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.))))
SumNA <- sum.NA.total %>% map_dbl(sum)
test <- SumNA[SumNA > rowMeans(sum.NA.total)]
kable(test, format = "markdown")
| x | |
|---|---|
| C150_4 | 109662 |
| COSTT4_A | 118589 |
| COSTT4_P | 132332 |
| TUITIONFEE_IN | 79394 |
| TUITIONFEE_OUT | 81437 |
| TUITIONFEE_PROG | 112990 |
| PCTPELL | 86847 |
| PCTFLOAN | 93269 |
| UG25ABV | 88503 |
| LOCALE2 | 154282 |
| LATITUDE | 147614 |
| LONGITUDE | 147614 |
| ADM_RATE | 111699 |
| ADM_RATE_ALL | 107644 |
| SAT_AVG | 130475 |
| ACTCMMID | 133351 |
| UG | 148139 |
| UGDS_WHITE | 86458 |
| UGDS_BLACK | 86458 |
| UGDS_HISP | 86458 |
| UGDS_ASIAN | 86458 |
| NPT4_PUB | 137528 |
| NPT4_PRIV | 118532 |
| MD_INC_DEBT_MDN | 74465 |
| HI_INC_DEBT_MDN | 75010 |
| MN_EARN_WNE_P10 | 119439 |
| MD_EARN_WNE_P10 | 119439 |
Here is the % of missing data for the column and only variables with above average missing data.
PropOfNA.total <- (sum.NA.total/nrow(college_scorecard))*100
SumNA.prop <- PropOfNA.total %>% map_dbl(sum)
prop <- SumNA.prop[SumNA.prop > rowMeans(PropOfNA.total)]
kable(prop, format = "markdown")
| x | |
|---|---|
| C150_4 | 71.07893 |
| COSTT4_A | 76.86509 |
| COSTT4_P | 85.77281 |
| TUITIONFEE_IN | 51.46031 |
| TUITIONFEE_OUT | 52.78451 |
| TUITIONFEE_PROG | 73.23602 |
| PCTPELL | 56.29108 |
| PCTFLOAN | 60.45358 |
| UG25ABV | 57.36444 |
| LOCALE2 | 100.00000 |
| LATITUDE | 95.67804 |
| LONGITUDE | 95.67804 |
| ADM_RATE | 72.39924 |
| ADM_RATE_ALL | 69.77094 |
| SAT_AVG | 84.56917 |
| ACTCMMID | 86.43328 |
| UG | 96.01833 |
| UGDS_WHITE | 56.03894 |
| UGDS_BLACK | 56.03894 |
| UGDS_HISP | 56.03894 |
| UGDS_ASIAN | 56.03894 |
| NPT4_PUB | 89.14066 |
| NPT4_PRIV | 76.82815 |
| MD_INC_DEBT_MDN | 48.26551 |
| HI_INC_DEBT_MDN | 48.61876 |
| MN_EARN_WNE_P10 | 77.41603 |
| MD_EARN_WNE_P10 | 77.41603 |
We found that graduation rate ranges from 0.00% to 100%. Also the mean is equal to the median, indicating that there is not significant skew in the data.
summary(college_scorecard$C150_4)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.34 0.48 0.48 0.63 1.00 109662
Number of observations with graduation rate equal to zero:
zero <- select(college_scorecard, C150_4, YEAR, INSTNM) %>% dplyr::filter(C150_4 == 0) %>% count()
kable(zero)
| n |
|---|
| 651 |
Number of observations with graduation rate equal to one:
one <- select(college_scorecard, C150_4, YEAR, INSTNM) %>% dplyr::filter(C150_4 == 1) %>% count()
kable(one)
| n |
|---|
| 475 |
Here we have schools with graduation rate equal to zero, ordered by number of undergrads
query <- select(college_scorecard, INSTNM, YEAR, C150_4, UG) %>% dplyr::filter(C150_4 == 0) %>% arrange(UG)
top10.query <- head(query, 10)
kable(top10.query, format = 'markdown')
| INSTNM | YEAR | C150_4 | UG |
|---|---|---|---|
| Mesivtha Tifereth Jerusalem of America | 2000 | 0 | 35 |
| Magnolia Bible College | 2000 | 0 | 37 |
| Hellenic College-Holy Cross Greek Orthodox School of Theology | 2000 | 0 | 47 |
| LIU Brentwood | 2000 | 0 | 94 |
| Midwestern Baptist Theological Seminary | 2000 | 0 | 110 |
| Trinity College of Florida | 2000 | 0 | 147 |
| Burlington College | 2000 | 0 | 206 |
| Notre Dame College | 2000 | 0 | 215 |
| National American University-Roseville | 2000 | 0 | 586 |
| STRAYER UNIVERSITY-LOUDOUN CAMPUS | 2000 | 0 | 861 |
Here we have schools with graduation rate equal to one, ordered by number of undergrads
#schools with grad rate == 1, ordered by # undergrads
query2 <- select(college_scorecard, INSTNM, YEAR, C150_4, UG) %>% dplyr::filter(C150_4 == 1) %>% arrange(UG)
top10.query2 <- head(query2, 10)
kable(top10.query2, format = 'markdown')
| INSTNM | YEAR | C150_4 | UG |
|---|---|---|---|
| BANGOR THEOLOGICAL SEMINARY | 2000 | 1 | 0 |
| Saint Lukes Hospital Medical Technology Program | 2000 | 1 | 6 |
| Mid-America College of Funeral Service | 2000 | 1 | 36 |
| Longy School of Music of Bard College | 2000 | 1 | 37 |
| Winebrenner Theological Seminary | 2000 | 1 | 54 |
| Holy Apostles College and Seminary | 2000 | 1 | 68 |
| Heritage Christian University | 2000 | 1 | 95 |
| MASSACHUSETTS COLLEGE OF PHARMACY AND HEALTH SCIEN | 2000 | 1 | 102 |
| John Wesley University | 2000 | 1 | 154 |
| Jewish Theological Seminary of America | 2000 | 1 | 180 |
We decided to look up some of these schools with graduation rates equal to zero. Some of the schools were new, some are just very small, and some we could find no information on. We decided to leave out rows where the graduation rate was equal to zero. It doesn’t make much sense for a school with 861 undergrades to have a graduation rate of zero for a particular year.
We removed rows with graduation rate equal to zero and rows with NA graduation rate:
# To leave out rows with grad rate == 0 or NA
college1 <- college_scorecard %>% dplyr::filter(college_scorecard$C150_4 != 0)
college <- college1 %>% filter(!is.na(college1$C150_4))
Over time we can see there has been more data collected / school participation.
query3 <- select(college, C150_4, YEAR) %>% group_by(YEAR) %>% count()
grad.rate.bar <- ggplot(query3, aes(YEAR, n)) + geom_bar(stat="identity", width = 0.5, fill="tomato2") + labs(title = "Number of Observations by Year", y = "Count", x = "Year") + theme_classic()
grad.rate.bar
There has been a steady graduation rate over time.
query3.1 <- select(college, C150_4, YEAR) %>% group_by(YEAR) %>% summarise(average = mean(C150_4))
grad.rate.by.year <- ggplot(query3.1, aes(YEAR, average)) + geom_bar(stat="identity", width = 0.5, fill="tomato2") + labs(title = "Average Graduation Rate by Year", y = "Average Graduation Rate", x = "Year") + theme_classic()
grad.rate.by.year
The distribution of graduation rate is approximately normal with a mean of 0.49 and standard deviation of 0.21.
grad.rate.hist <- ggplot(college, aes(x = C150_4)) + geom_histogram(color="darkblue", fill="lightblue") + labs(title = "Histogram of Graduation Rate",x = "Graduation Rate") + theme_classic()
grad.rate.hist
A caption
An interactive map: http://127.0.0.1:12642/custom/googleVis/GeoChartID99057bbae7ce.html
Here we have the bottom ten lowest average graduation rate of our sample.
bottom10 <- df %>% arrange(data.Average)
colnames(bottom10)[colnames(bottom10) == "data.State"] <- "State"
colnames(bottom10)[colnames(bottom10) == "data.Average"] <- "Average Graduation Rate"
kable(bottom10[1:10,], format = "markdown")
| State | Average Graduation Rate |
|---|---|
| American Samoa | 0.27 |
| Nevada | 0.27 |
| Virgin Islands | 0.28 |
| Alaska | 0.31 |
| Oklahoma | 0.35 |
| Georgia | 0.36 |
| Hawaii | 0.36 |
| Puerto Rico | 0.36 |
| North Dakota | 0.37 |
| New Mexico | 0.37 |
Here we have the top ten highest average graduation rate of our sample.
top10 <- df %>% arrange(desc(data.Average))
colnames(top10)[colnames(top10) == "data.State"] <- "State"
colnames(top10)[colnames(top10) == "data.Average"] <- "Average Graduation Rate"
kable(top10[1:10,], format = "markdown")
| State | Average Graduation Rate |
|---|---|
| Rhode Island | 0.67 |
| Massachusetts | 0.61 |
| Iowa | 0.58 |
| Pennsylvania | 0.58 |
| Connecticut | 0.57 |
| Illinois | 0.54 |
| New Hampshire | 0.54 |
| Wyoming | 0.54 |
| California | 0.53 |
| Maryland | 0.53 |
# set seed:
set.seed(1234)
# create an ID for each row:
college <- college %>% mutate(id = row_number())
#Create training set
college.train <- college %>% sample_frac(.70)
#Create test set
college.test <- anti_join(college, college.train, by = 'id')
# took numeric columns and exported to CSV for python file
numeric.columns <- dplyr::select_if(college.train, is.numeric)
#write_csv(numeric.columns, "numeric.columns.csv")
# compute cor from all numeric columns:
cor.matrix <- round(cor(numeric.columns, use = "pairwise"), 2)
# melt the cor matrix:
cor.matrix.melt <- melt(cor.matrix)
# plot cor matrix
ggplot(data = cor.matrix.melt, aes(x=X1, y=X2, fill=value)) +
geom_tile() + theme(axis.text.x = element_text(angle=65, vjust=0.6))
All variables with absolute value of correlation greater than 0.50:
# get corr of grad rate vs everything else:
grad.rate.r <- data.frame(cor.matrix[2,])
colnames(grad.rate.r)[colnames(grad.rate.r) == "cor.matrix.2..."] <- "Correlation"
# subset data to return all varaibles with cor > 0.50:
grad.rate.r.sub <- subset(grad.rate.r, abs(Correlation) > 0.50)
kable(grad.rate.r.sub, format = 'markdown')
| Correlation | |
|---|---|
| C150_4 | 1.00 |
| COSTT4_A | 0.54 |
| TUITIONFEE_OUT | 0.55 |
| SAT_AVG | 0.78 |
| ACTCMMID | 0.77 |
| NPT4_PUB | 0.54 |
| PAR_ED_PCT_1STGEN | -0.58 |
| PAR_ED_PCT_HS | -0.56 |
| PAR_ED_PCT_PS | 0.58 |
| DEP_INC_AVG | 0.57 |
| FAMINC | 0.59 |
| MD_FAMINC | 0.57 |
Now we took a subset of the data with only those variables above.
# subset data to return all columns with cor < 0.50
college.train.highR <- select(college.train, C150_4, COSTT4_A, TUITIONFEE_OUT, SAT_AVG, NPT4_PUB, PAR_ED_PCT_1STGEN, PAR_ED_PCT_HS, PAR_ED_PCT_PS, DEP_INC_AVG, FAMINC)
Looped through to compute scatter plots vs graduation rate.
# function to plot all variables of interest vs graduation rate:
plotScatFunc <- function(x, na.rm = TRUE, ...) {
nm <- names(x)
for (i in seq_along(nm)) {
print(ggplot(x, aes_string(x = nm[i], y= nm[1])) + geom_point(alpha = 0.3, fill = "blue") +
geom_smooth(method = "lm")) + theme_classic()
}
}
plotScatFunc(college.train.highR)
We started by grabbing only complete cases:
# test:
college.test.highR <- select(college.test, C150_4, COSTT4_A, TUITIONFEE_OUT, SAT_AVG, NPT4_PUB, PAR_ED_PCT_1STGEN, PAR_ED_PCT_HS, PAR_ED_PCT_PS, DEP_INC_AVG, FAMINC)
college.test.highR.com <- college.test.highR %>% filter(complete.cases(.))
# train:
college.train.highR.com <- college.train.highR %>% filter(complete.cases(.))
Now to confirm train data with complete cases is consistent with the data:
# GGPLOT hist of grad rate for train data
grad.rate.train.hist <- ggplot(college.train.highR.com, aes(x = C150_4)) + geom_histogram(color="darkblue", fill="lightblue") + labs(title = "Train Data",x = "Graduation Rate") + theme_classic()
college.train.highR.year <- select(college.train, YEAR, C150_4, COSTT4_A, TUITIONFEE_OUT, SAT_AVG, NPT4_PUB, PAR_ED_PCT_1STGEN, PAR_ED_PCT_HS, PAR_ED_PCT_PS, DEP_INC_AVG, FAMINC)
college.train.highR.year.com <- college.train.highR.year %>% filter(complete.cases(.))
# GGPLOT bar of grad rate by year train data
query5 <- select(college.train.highR.year.com, C150_4, YEAR) %>% group_by(YEAR) %>% count()
grad.rate.train.bar <- ggplot(query5, aes(YEAR, n)) + geom_bar(stat="identity", width = 0.5, fill="tomato2") +
labs(title = "Train Data", y = "Count", x = "Year") + theme_classic()
The distribution of graduation rate is still approximately normal with mean 0.49.
grid.arrange(grad.rate.hist, grad.rate.train.hist, ncol=2)
However, we now only have years in the range 2009 to 2016. But this should be fine because graduation rate has stayed consistent over time.
summary(college.train.highR.year.com$YEAR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2009 2011 2013 2013 2014 2016
grid.arrange(grad.rate.bar, grad.rate.train.bar, ncol=2)
We now have 2682 observations for the train data and 1165 observations for the test data.
# check # of complete cases:
sum(complete.cases(college.train.highR))
## [1] 2682
sum(complete.cases(college.test.highR.com))
## [1] 1165
Our first regression uses all the explanatory variables.
| variables | variable_desc |
|---|---|
| COSTT4_A | Average cost of attendance |
| SAT_AVG | Average SAT score |
| TUITIONFEE_OUT | Out-of-state tuition and fees |
| NPT4_PUB | Average net price for Title IV institutions |
| PAR_ED_PCT_1STGEN | % first-generation students |
| PAR_ED_PCT_HS | % of students whose parents’ highest educ is high school |
| PAR_ED_PCT_PS | % of students whose parents’ highest educ level - post secondary |
| FAMINC | Average family income in real 2015 dollars |
# Model 1
fit1 <- lm(C150_4 ~ COSTT4_A + + SAT_AVG + TUITIONFEE_OUT + NPT4_PUB + PAR_ED_PCT_1STGEN + PAR_ED_PCT_HS + PAR_ED_PCT_PS + FAMINC, data = college.train.highR.com)
summary(fit1)
##
## Call:
## lm(formula = C150_4 ~ COSTT4_A + +SAT_AVG + TUITIONFEE_OUT +
## NPT4_PUB + PAR_ED_PCT_1STGEN + PAR_ED_PCT_HS + PAR_ED_PCT_PS +
## FAMINC, data = college.train.highR.com)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29502 -0.05157 0.00418 0.05681 0.41721
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.813e-01 2.985e-02 -16.124 < 2e-16 ***
## COSTT4_A 9.988e-06 9.503e-07 10.510 < 2e-16 ***
## SAT_AVG 6.845e-04 2.080e-05 32.900 < 2e-16 ***
## TUITIONFEE_OUT 5.411e-07 4.206e-07 1.287 0.198326
## NPT4_PUB -3.553e-06 9.253e-07 -3.840 0.000126 ***
## PAR_ED_PCT_1STGEN 8.284e-01 8.087e-02 10.243 < 2e-16 ***
## PAR_ED_PCT_HS -1.099e+00 9.384e-02 -11.708 < 2e-16 ***
## PAR_ED_PCT_PS NA NA NA NA
## FAMINC 2.985e-06 1.601e-07 18.646 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0803 on 2674 degrees of freedom
## Multiple R-squared: 0.7636, Adjusted R-squared: 0.7629
## F-statistic: 1234 on 7 and 2674 DF, p-value: < 2.2e-16
We found that all variables were significant except for TUITIONFEE_OUT and PAR_ED_PCT_PS. We also have a strong R^2 of 0.76.
We ran the regression again but without TUITIONFEE_OUT and PAR_ED_PCT_PS.
# Model 2
fit2 <- lm(C150_4 ~ COSTT4_A + + SAT_AVG + TUITIONFEE_OUT + PAR_ED_PCT_1STGEN + PAR_ED_PCT_HS + FAMINC, data = college.train.highR.com)
summary(fit2)
##
## Call:
## lm(formula = C150_4 ~ COSTT4_A + +SAT_AVG + TUITIONFEE_OUT +
## PAR_ED_PCT_1STGEN + PAR_ED_PCT_HS + FAMINC, data = college.train.highR.com)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27848 -0.05125 0.00422 0.05692 0.42342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.773e-01 2.991e-02 -15.959 < 2e-16 ***
## COSTT4_A 7.032e-06 5.585e-07 12.590 < 2e-16 ***
## SAT_AVG 6.949e-04 2.068e-05 33.609 < 2e-16 ***
## TUITIONFEE_OUT 1.135e-06 3.921e-07 2.895 0.00382 **
## PAR_ED_PCT_1STGEN 9.380e-01 7.586e-02 12.365 < 2e-16 ***
## PAR_ED_PCT_HS -1.221e+00 8.849e-02 -13.799 < 2e-16 ***
## FAMINC 2.739e-06 1.471e-07 18.624 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08051 on 2675 degrees of freedom
## Multiple R-squared: 0.7623, Adjusted R-squared: 0.7617
## F-statistic: 1429 on 6 and 2675 DF, p-value: < 2.2e-16
This time all explanatory variables are significant and our R^2 remained strong at 0.76.
To confirm normality of the residuals:
# check normality of residuals
plot(fit2$residuals, ylab = "residuals")
hist(fit2$residuals, main = "Histogram of Residuals", xlab = "Residuals")
qqnorm(fit2$residuals)
Now to predict graduation rate using our model. The resulting RMSE was 0.08.
# predict grad rate with test data:
predictions <- fit2 %>% predict(college.test.highR.com)
# compute R2, RMSE, and MAE from 2 models:
results <- data.frame(R2 = R2(predictions, college.test.highR.com$C150_4),
RMSE = RMSE(predictions, college.test.highR.com$C150_4),
MAE = MAE(predictions, college.test.highR.com$C150_4))
kable(results, format = "markdown")
| R2 | RMSE | MAE |
|---|---|---|
| 0.7542626 | 0.0809669 | 0.0638532 |
# plot of grad rate vs predicted grad rate:
plot(college.test.highR.com$C150_4, predictions, xlim=c(0.2,1), ylim=c(0.2,1), main = "Plot of Predicted Grad Rates vs Actual", xlab=("Graduation Rate"), ylab=("Predicted Graduation Rate"))
abline(coef=c(0,1), col = "red")
It can be concluded that factors including average cost of attendance, tuition and fees for out of state students, an institution’s percentage of first generation students, an institution’s percentage of students who’s parents highest education is high school, and average family income are statistically significant in relation to graduation rates. The regression assumption of independence and normal distribution of the error term was confirmed in the residual plots as well as normality of the response variable graduation rate in the histogram. Additionally, the relatively small RMSE (0.07 < 0.5) resulting from the cross validation method suggests that the model will be respectively accurate in practice for predicting graduation rates utilizing the reported explanatory variables. It is notable to state that there is a lack of data on students who do not receive Title IV aid that may result in somewhat biased estimates of collective student outcomes at institutions with low proportions of Title IV aided students as well as the observations regressed on only span from 2009-2016. However, relative to other publicly available data sources that have their own limitations, the data is still likely to be a significant improvement in providing insight for conditions shaping the graduation rates at various institutions.